This article explains the importance of data validation in a machine learning pipeline and demonstrates how to use TensorFlow Data Validation (TFDV) to validate data. It covers the 5 stages of machine learning validation: generating statistics from training data, inferring schema from training data, generating statistics for evaluation data and comparing it with training data, identifying and fixing anomalies, and checking for drifts and data skew.
The article discusses the challenges faced in evaluating anomaly detection in time series data and introduces Proximity-Aware Time series anomaly Evaluation (PATE) as a solution. PATE provides a weighted version of Precision and Recall curve and considers temporal correlations and buffer zones for a more accurate and nuanced evaluation.